import pandas as pd
import numpy as np
import altair as altDiabetes Analysis
Data Reference: https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators
Summary
This project attempts to predict diabetes status using the Logistic Regression and LinearSVC models, against a baseline DummyClassifier on an imbalanced dataset. All models achieved similar accuracy on the test set (approximately 0.86), which highlights a key issue: accuracy alone is not a reliable performance metric.
These findings motivate deeper exploratory data analysis, evaluation with additional metrics (precision, recall, F1), and exploration of alternative models and threshold tuning to get a more robust assessment of the model’s predictability.
Introduction
Diabetes is a chronic disease that prevents the body from properly controlling blood sugar levels, which can lead to serious health problems including heart disease, vision loss, kidney disease, and limb amputation (Teboul, 2020). Given the severity of the disease, early detection can allow people to make lifestyle changes and receive treatment that can slow disease progression. We believe that machine learning models using survey data can offer a promising way to create accessible, cost-effective screening tools to identify high-risk individuals and support public health efforts.
Research Question
Can we use health indicators and lifestyle factors from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) survey to accurately predict whether an individual has diabetes?
We are looking to : 1. Build and evaluate classification models that predict diabetes status based on 21 health and lifestyle features 2. Compare the performance and efficiency of logistic regression and support vector machine (SVM) classifiers 3. Assess whether survey-based features can provide sufficiently accurate predictions for practical screening applications
Methods & Results
This analysis uses the diabetes_binary_health_indicators_BRFSS2015.csv dataset, a cleaned and preprocessed version of the CDC’s 2015 Behavioral Risk Factor Surveillance System (BRFSS) survey data, made available by Alex Teboul on Kaggle (Teboul, 2020).
For this analysis, we split the dataset into training (80%) and testing (20%) sets using a fixed random state (522) to ensure reproducibility. We implemented two classification algorithms:
- Logistic Regression: A linear model appropriate for binary classification that estimates the probability of diabetes based on a linear combination of features.
- Linear Support Vector Classifier (SVC): A classifier that finds an optimal hyperplane to separate diabetic from non-diabetic individuals.
Both models were implemented using scikit-learn pipelines that include feature standardization (StandardScaler) to normalize the numeric features to comparable scales. Binary categorical features were already processed in the dataset and were set to pass through the column transformer. We evaluated model performance using cross-validation on the training set and final accuracy assessment on the held-out test set.
Our results show that both models achieve approximately 86% accuracy, with logistic regression demonstrating slightly faster training time.
Discussion
The baseline DummyClassifier achieves an accuracy score of about 0.86, derived from assigning the most frequent class (non-diabetic) to all patients. This highlights how approximatey 86% of the dataset is non-diabetic. Both Logistic Regression and LinearSVC achieve similar accuracy (approximately 0.86) with little to no improvement.
The EDA showed that there is class imbalance (more non-diabetic than diabetic patients) and this may affect the models’ reliability. Therefore, more analysis is needed to explore additional models, check class balance with metrics such as precision and recall, examine confusion matrices, and test different data splits or tune hyperparameters to determine if performance is stable across scenarios before drawing strong conclusions.
The similarity in test scores is an unexpected finding. With a clean dataset containing informative and diverse features, we would expect the classification models to perform at least better than the dummy classifier. Additionally, initial hyperparameter tuning for logistic regression did not affect accuracy (data not shown). This finding highlights the importance of understanding the data through EDA to interpret where accuracy scores come from.
This suggests the next step for deeper EDA, including distributions, to see whether features overlap and whether the model can separate them effectively. Other future questions would be determining which features are most important for classifying an individual as diabetic or not, evaluating the probability estimates, and assessing whether all features are truly helpful for drawing conclusions.
Analysis
Read in Data
from ucimlrepo import fetch_ucirepo
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)
dat = cdc_diabetes_health_indicators.data.originalTrain Test Split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
train_df, test_df = train_test_split(dat, test_size=0.2, random_state=522)
X_train, y_train = (
train_df.drop(columns=["Diabetes_binary"]),
train_df["Diabetes_binary"],
)
X_test, y_test = (
test_df.drop(columns=["Diabetes_binary"]),
test_df["Diabetes_binary"],
)train_df.head()| ID | Diabetes_binary | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | ... | AnyHealthcare | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 180125 | 180125 | 0 | 0 | 0 | 1 | 29 | 1 | 0 | 0 | 1 | ... | 1 | 0 | 3 | 0 | 0 | 0 | 1 | 9 | 6 | 7 |
| 49393 | 49393 | 0 | 1 | 1 | 1 | 26 | 1 | 0 | 0 | 1 | ... | 1 | 0 | 3 | 0 | 0 | 0 | 0 | 9 | 6 | 8 |
| 86115 | 86115 | 0 | 1 | 1 | 1 | 27 | 1 | 0 | 0 | 1 | ... | 1 | 0 | 2 | 0 | 0 | 0 | 1 | 9 | 4 | 5 |
| 249968 | 249968 | 0 | 0 | 0 | 1 | 27 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 3 | 0 | 0 | 1 | 0 | 10 | 6 | 5 |
| 196362 | 196362 | 0 | 1 | 0 | 1 | 28 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 2 | 0 | 0 | 0 | 1 | 8 | 6 | 8 |
5 rows × 23 columns
train_df.tail()| ID | Diabetes_binary | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | ... | AnyHealthcare | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 135498 | 135498 | 0 | 0 | 0 | 1 | 23 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 1 | 2 | 0 | 0 | 1 | 6 | 6 | 8 |
| 143767 | 143767 | 0 | 1 | 1 | 1 | 28 | 1 | 0 | 0 | 1 | ... | 1 | 0 | 2 | 0 | 0 | 0 | 1 | 10 | 4 | 6 |
| 68896 | 68896 | 0 | 0 | 0 | 1 | 28 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 3 | 0 | 0 | 0 | 1 | 7 | 4 | 5 |
| 247659 | 247659 | 0 | 0 | 0 | 0 | 31 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 5 | 4 | 8 |
| 61332 | 61332 | 0 | 1 | 0 | 1 | 30 | 1 | 0 | 0 | 1 | ... | 1 | 0 | 3 | 0 | 0 | 0 | 1 | 4 | 6 | 8 |
5 rows × 23 columns
train_df.shape(202944, 23)
train_df.columnsIndex(['ID', 'Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
'Education', 'Income'],
dtype='object')
Data Validation
import pointblank as pb
########################## Data Validation: Correct file format
## Checks that the training data has exactly the same number of columns as the
## DataFrame itself (validates the column count)
validation_1_1 = (
pb.Validate(data=train_df)
.col_count_match(len(train_df.columns))
.interrogate()
)
## Checks that the training data has correct number of observations/rows
## 80% split for training data from the total of original data instances.
rows, cols = dat.shape
train_target = int(rows * 0.8)
validation_1_2 = (
pb.Validate(data=train_df)
.row_count_match(train_target)
.interrogate()
)
validation_1_1
validation_1_2| Pointblank Validation | |||||||||||||
2025-11-29|01:12:15 Pandas |
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #4CA64C | 1 |
row_count_match()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
2025-11-29 01:12:15 UTC< 1 s2025-11-29 01:12:15 UTC |
|||||||||||||
########################## Data Validation: Correct column names
### Check that data contains all required column names and matches the expected schema.
expected_columns = ['ID', 'Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
'Education', 'Income']
validation_2 = (
pb.Validate(data = train_df)
.col_exists(columns = expected_columns)
.interrogate()
)
validation_2| Pointblank Validation | |||||||||||||
2025-11-29|01:12:15 Pandas |
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #4CA64C | 1 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 2 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 3 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 4 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 5 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 6 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 7 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 8 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 9 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 10 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 11 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 12 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 13 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 14 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 15 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 16 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 17 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 18 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 19 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 20 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 21 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 22 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 23 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
2025-11-29 01:12:15 UTC< 1 s2025-11-29 01:12:15 UTC |
|||||||||||||
########################## Data Validation: No empty observations
## Checks that all rows are complete and contain no missing values.
validation_3 = (
pb.Validate(data = train_df)
.rows_complete()
.interrogate()
)
validation_3| Pointblank Validation | |||||||||||||
2025-11-29|01:12:16 Pandas |
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #4CA64C | 1 |
rows_complete()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
— | — | — | — | |||
2025-11-29 01:12:16 UTC< 1 s2025-11-29 01:12:16 UTC |
|||||||||||||
########################## Data Validation: No empty observations
## Checks that each column has 100 % non-missing values. There are no missing values in dataset.
threshold = 1 # There are no missing values.
validator = pb.Validate(data=train_df)
for col in train_df.columns:
validator = validator.col_vals_not_null(columns=str(col), thresholds=threshold)
validation_4 = validator.interrogate()
validation_4| Pointblank Validation | |||||||||||||
2025-11-29|01:12:16 Pandas |
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #4CA64C | 1 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 2 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 3 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 4 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 5 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 6 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 7 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 8 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 9 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 10 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 11 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 12 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 13 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 14 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 15 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 16 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 17 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 18 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 19 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 20 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 21 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 22 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
| #4CA64C | 23 |
col_vals_not_null()
|
✓ | 203K | 203K 1.00 |
0 0.00 |
○ | — | — | — | |||
2025-11-29 01:12:16 UTC< 1 s2025-11-29 01:12:17 UTC |
|||||||||||||
Notes Step 1 (local_thresholds) Step-specific thresholds set with W:1. Step 2 (local_thresholds) Step-specific thresholds set with W:1. Step 3 (local_thresholds) Step-specific thresholds set with W:1. Step 4 (local_thresholds) Step-specific thresholds set with W:1. Step 5 (local_thresholds) Step-specific thresholds set with W:1. Step 6 (local_thresholds) Step-specific thresholds set with W:1. Step 7 (local_thresholds) Step-specific thresholds set with W:1. Step 8 (local_thresholds) Step-specific thresholds set with W:1. Step 9 (local_thresholds) Step-specific thresholds set with W:1. Step 10 (local_thresholds) Step-specific thresholds set with W:1. Step 11 (local_thresholds) Step-specific thresholds set with W:1. Step 12 (local_thresholds) Step-specific thresholds set with W:1. Step 13 (local_thresholds) Step-specific thresholds set with W:1. Step 14 (local_thresholds) Step-specific thresholds set with W:1. Step 15 (local_thresholds) Step-specific thresholds set with W:1. Step 16 (local_thresholds) Step-specific thresholds set with W:1. Step 17 (local_thresholds) Step-specific thresholds set with W:1. Step 18 (local_thresholds) Step-specific thresholds set with W:1. Step 19 (local_thresholds) Step-specific thresholds set with W:1. Step 20 (local_thresholds) Step-specific thresholds set with W:1. Step 21 (local_thresholds) Step-specific thresholds set with W:1. Step 22 (local_thresholds) Step-specific thresholds set with W:1. Step 23 (local_thresholds) Step-specific thresholds set with W:1. |
|||||||||||||
numeric_features = ["BMI"]
binary_features = ["HighBP", "HighChol", "CholCheck", "Smoker", "Stroke",
"HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump",
"AnyHealthcare", "NoDocbcCost", "DiffWalk", "Sex"]
ordinal_features = ["GenHlth", "MentHlth", "PhysHlth", "Age", "Education", "Income"]import pointblank as pb
########################## Data Validation: Correct data types in each column
################ If fails: Critical checks (schema) -> Let it fail naturally and stop the pipeline
schema_columns = [(col, "int64") for col in train_df.columns]
schema = pb.Schema(columns=schema_columns)
(
pb.Validate(data=train_df)
.col_schema_match(schema=schema)
.interrogate()
)| Pointblank Validation | |||||||||||||
2025-11-29|01:12:18 Pandas |
|||||||||||||
| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #4CA64C | 1 |
col_schema_match()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
2025-11-29 01:12:18 UTC< 1 s2025-11-29 01:12:18 UTC |
|||||||||||||
########################## Data Validation: No duplicate observations
################ If fails: Non-Critical -> raise warnings and continue
unique_key_cols = ["ID"] # use only the primary key column "ID"
try:
(
pb.Validate(data=train_df)
.rows_distinct(columns_subset=unique_key_cols)
.interrogate()
)
except:
print("Data Validation failed: Duplicate Observation detected")########################## Data Validation: No outlier or anomalous values for NUMERIC Features
###### Through define acceptable numeric ranges
## (based on the data collection method and domain knowledge)
################ If fails: Non-Critical -> raise warnings and continue
try:
(
pb.Validate(data=train_df)
.col_vals_between(columns="BMI", left=10, right=100) # BMI is unlikely to go under 10 or exceed 100
.interrogate()
)
except:
print("Data Validation failed: Outlier or anomalous values detected")################################## checking the value ranges for ordinal features
for f in ordinal_features:
temp_col = train_df[f]
print(f"========================================== {f}")
print(f"datatype: {temp_col.dtype}")
print(temp_col.sort_values().value_counts().index)========================================== GenHlth
datatype: int64
Index([2, 3, 1, 4, 5], dtype='int64', name='GenHlth')
========================================== MentHlth
datatype: int64
Index([ 0, 2, 30, 5, 1, 3, 10, 15, 4, 20, 7, 25, 14, 6, 8, 12, 28, 21,
29, 16, 9, 18, 27, 22, 17, 26, 11, 23, 13, 24, 19],
dtype='int64', name='MentHlth')
========================================== PhysHlth
datatype: int64
Index([ 0, 30, 2, 1, 3, 5, 10, 15, 7, 4, 20, 14, 25, 6, 8, 21, 12, 28,
29, 9, 18, 16, 17, 27, 24, 13, 11, 22, 26, 23, 19],
dtype='int64', name='PhysHlth')
========================================== Age
datatype: int64
Index([9, 10, 8, 7, 11, 6, 13, 5, 12, 4, 3, 2, 1], dtype='int64', name='Age')
========================================== Education
datatype: int64
Index([6, 5, 4, 3, 2, 1], dtype='int64', name='Education')
========================================== Income
datatype: int64
Index([8, 7, 6, 5, 4, 3, 2, 1], dtype='int64', name='Income')
########################## Data Validation: Correct category levels for Category/Ordinal Features
###### Through define acceptable value set or range
## (based on the data collection method and domain knowledge)
################ If fails: Non-Critical -> raise warnings and continue
try:
(
pb.Validate(data=train_df)
.col_vals_in_set(columns=binary_features, set=[0,1]) # binary features: 0/1
.col_vals_in_set(columns="GenHlth", set=list(range(1,6))) # scale of 1-5
.col_vals_between(columns=["MentHlth", "PhysHlth"], left=0, right=30) # number of days out of 30 days
.col_vals_in_set(columns="Age", set=list(range(1,14))) # scale of 1-13
.col_vals_in_set(columns="Education", set=list(range(1,7))) # scale of 1-6
.col_vals_in_set(columns="Income", set=list(range(1,9))) # scale of 1-8
.interrogate()
)
except:
print("Data Validation failed: Incorrect category levels detected")train_df.info()<class 'pandas.core.frame.DataFrame'>
Index: 202944 entries, 180125 to 61332
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 202944 non-null int64
1 Diabetes_binary 202944 non-null int64
2 HighBP 202944 non-null int64
3 HighChol 202944 non-null int64
4 CholCheck 202944 non-null int64
5 BMI 202944 non-null int64
6 Smoker 202944 non-null int64
7 Stroke 202944 non-null int64
8 HeartDiseaseorAttack 202944 non-null int64
9 PhysActivity 202944 non-null int64
10 Fruits 202944 non-null int64
11 Veggies 202944 non-null int64
12 HvyAlcoholConsump 202944 non-null int64
13 AnyHealthcare 202944 non-null int64
14 NoDocbcCost 202944 non-null int64
15 GenHlth 202944 non-null int64
16 MentHlth 202944 non-null int64
17 PhysHlth 202944 non-null int64
18 DiffWalk 202944 non-null int64
19 Sex 202944 non-null int64
20 Age 202944 non-null int64
21 Education 202944 non-null int64
22 Income 202944 non-null int64
dtypes: int64(23)
memory usage: 37.2 MB
from deepchecks.tabular import Dataset, Suite
deep_train = Dataset(train_df.drop(columns=['ID']),
label="Diabetes_binary",
cat_features=binary_features)/Users/iangault/miniforge3/envs/522-project/lib/python3.11/site-packages/deepchecks/core/serialization/dataframe/html.py:16: UserWarning:
pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from deepchecks.tabular.checks import ClassImbalance, FeatureLabelCorrelation, FeatureFeatureCorrelation
import anywidget, ipywidgets
########################## Data Validation: Check for class imbalance, anomalous results between feature-feature or feature-label
### Having class imbalance for diabetes prediction is expected, isn't a warning about the dataset
### Feature-label: Chose 0.5 as a threshold, given that it is variable health and lifestyle data and that it would be unexpected to find high coorelations for any one feature
### Feature-Feature: watches for multicolinearity, set threhold higher because it's reasonable for some features to potentially be more coorelated here
### Ian Gault: I looked up example on ChatGPT5 on how to use deepchecks for class imbalance and coorelations and what modules they would be in. I found the synthax with Suite and implemented that style here. I was also running into errors packages being synced or needed for deepchecks, so found out more information about these errors too for debugging purposes.
suite = Suite(
"Validation",
ClassImbalance(),
FeatureLabelCorrelation(correlation_threshold=0.5),
FeatureFeatureCorrelation(correlation_threshold=0.7),
)
suite_result = suite.run(deep_train)
suite_result[Class Imbalance: {0: 0.86, 1: 0.14},
Feature Label Correlation: {'BMI': 0.025392252368002362, 'MentHlth': 0.00187223805760016, 'PhysHlth': 0.0004214881102744822, 'Age': 2.4552993846964164e-07, 'Education': 2.423832833325499e-07, 'Income': 2.393484367427756e-07, 'CholCheck': 1.9204200058707858e-07, 'HeartDiseaseorAttack': 1.8853309243567383e-07, 'Sex': 1.8634822568287007e-07, 'Smoker': 1.8471139560866802e-07, 'AnyHealthcare': 1.8079021379344183e-07, 'NoDocbcCost': 1.8079021379344183e-07, 'Veggies': 1.78548217815205e-07, 'GenHlth': 1.7781842650920206e-07, 'PhysActivity': 1.7681596424922617e-07, 'DiffWalk': 1.7429470691623626e-07, 'HighBP': 0.0, 'HighChol': 0.0, 'Stroke': 0.0, 'Fruits': 0.0, 'HvyAlcoholConsump': 0.0},
Feature-Feature Correlation: BMI GenHlth MentHlth PhysHlth Age \
BMI 1.0 0.248686 0.048759 0.097948 -0.012271
GenHlth 0.248686 1.0 0.224643 0.459304 0.142618
MentHlth 0.048759 0.224643 1.0 0.295968 -0.164163
PhysHlth 0.097948 0.459304 0.295968 1.0 0.063486
Age -0.012271 0.142618 -0.164163 0.063486 1.0
Education -0.134065 -0.281615 -0.05047 -0.142485 -0.119831
Income -0.099379 -0.351473 -0.131049 -0.231459 -0.180841
HighBP 0.206217 0.294128 0.05726 0.166951 0.344644
HighChol 0.096175 0.207223 0.047914 0.120044 0.270924
CholCheck 0.04488 0.044226 0.005197 0.033418 0.09707
Smoker 0.01004 0.149567 0.065309 0.104804 0.13725
Stroke 0.016781 0.168659 0.072189 0.156608 0.122517
HeartDiseaseorAttack 0.040529 0.253665 0.05552 0.176386 0.230413
PhysActivity 0.135847 0.270157 0.119464 0.226629 0.103556
Fruits 0.088749 0.103484 0.052649 0.032026 0.059362
Veggies 0.063833 0.113243 0.044578 0.052623 0.013829
HvyAlcoholConsump 0.04814 0.035961 0.047216 0.016919 0.050196
AnyHealthcare 0.033011 0.057583 0.05472 0.011582 0.128848
NoDocbcCost 0.0482 0.158769 0.176788 0.155456 0.121746
DiffWalk 0.187383 0.45961 0.228855 0.494396 0.213355
Sex 0.025625 0.016219 0.069534 0.044024 0.018386
Education Income HighBP HighChol CholCheck ... \
BMI -0.134065 -0.099379 0.206217 0.096175 0.04488 ...
GenHlth -0.281615 -0.351473 0.294128 0.207223 0.044226 ...
MentHlth -0.05047 -0.131049 0.05726 0.047914 0.005197 ...
PhysHlth -0.142485 -0.231459 0.166951 0.120044 0.033418 ...
Age -0.119831 -0.180841 0.344644 0.270924 0.09707 ...
Education 1.0 0.448571 0.141847 0.065725 0.003434 ...
Income 0.448571 1.0 0.178816 0.085191 0.019539 ...
HighBP 0.141847 0.178816 1.0 0.065386 0.013644 ...
HighChol 0.065725 0.085191 0.065386 1.0 0.011797 ...
CholCheck 0.003434 0.019539 0.013644 0.011797 1.0 ...
Smoker 0.154603 0.089058 0.007172 0.005763 0.000003 ...
Stroke 0.073483 0.135976 0.01908 0.008409 0.004378 ...
HeartDiseaseorAttack 0.101728 0.140748 0.0384 0.028009 0.003874 ...
PhysActivity 0.190489 0.194654 0.011498 0.003996 0.00019 ...
Fruits 0.096641 0.059901 0.000262 0.000395 0.00099 ...
Veggies 0.133827 0.135437 0.002986 0.000935 0.001046 ...
HvyAlcoholConsump 0.01526 0.050463 0.0 0.0 0.000403 ...
AnyHealthcare 0.130493 0.159388 0.001038 0.001493 0.028118 ...
NoDocbcCost 0.11176 0.219431 0.000052 0.000002 0.007061 ...
DiffWalk 0.198533 0.330611 0.045017 0.017457 0.004597 ...
Sex 0.030295 0.135643 0.002325 0.00143 0.000165 ...
Stroke HeartDiseaseorAttack PhysActivity Fruits \
BMI 0.016781 0.040529 0.135847 0.088749
GenHlth 0.168659 0.253665 0.270157 0.103484
MentHlth 0.072189 0.05552 0.119464 0.052649
PhysHlth 0.156608 0.176386 0.226629 0.032026
Age 0.122517 0.230413 0.103556 0.059362
Education 0.073483 0.101728 0.190489 0.096641
Income 0.135976 0.140748 0.194654 0.059901
HighBP 0.01908 0.0384 0.011498 0.000262
HighChol 0.008409 0.028009 0.003996 0.000395
CholCheck 0.004378 0.003874 0.00019 0.00099
Smoker 0.005266 0.014002 0.004228 0.003194
Stroke 1.0 0.043585 0.005629 0.000001
HeartDiseaseorAttack 0.043585 1.0 0.004961 0.000072
PhysActivity 0.005629 0.004961 1.0 0.016617
Fruits 0.000001 0.000072 0.016617 1.0
Veggies 0.0015 0.001514 0.021121 0.050952
HvyAlcoholConsump 0.002543 0.00308 0.000151 0.000868
AnyHealthcare 0.00024 0.001034 0.00346 0.001229
NoDocbcCost 0.001417 0.000666 0.004101 0.001209
DiffWalk 0.035816 0.044123 0.053121 0.000742
Sex 0.000002 0.007933 0.001792 0.00741
Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost \
BMI 0.063833 0.04814 0.033011 0.0482
GenHlth 0.113243 0.035961 0.057583 0.158769
MentHlth 0.044578 0.047216 0.05472 0.176788
PhysHlth 0.052623 0.016919 0.011582 0.155456
Age 0.013829 0.050196 0.128848 0.121746
Education 0.133827 0.01526 0.130493 0.11176
Income 0.135437 0.050463 0.159388 0.219431
HighBP 0.002986 0.0 0.001038 0.000052
HighChol 0.000935 0.0 0.001493 0.000002
CholCheck 0.001046 0.000403 0.028118 0.007061
Smoker 0.000284 0.014651 0.000023 0.001257
Stroke 0.0015 0.002543 0.00024 0.001417
HeartDiseaseorAttack 0.001514 0.00308 0.001034 0.000666
PhysActivity 0.021121 0.000151 0.00346 0.004101
Fruits 0.050952 0.000868 0.001229 0.001209
Veggies 1.0 0.000193 0.000086 0.000507
HvyAlcoholConsump 0.000193 1.0 0.000004 0.000062
AnyHealthcare 0.000086 0.000004 1.0 0.073258
NoDocbcCost 0.000507 0.000062 0.073258 1.0
DiffWalk 0.004895 0.001377 0.000028 0.019495
Sex 0.003473 0.000016 0.000022 0.004393
DiffWalk Sex
BMI 0.187383 0.025625
GenHlth 0.45961 0.016219
MentHlth 0.228855 0.069534
PhysHlth 0.494396 0.044024
Age 0.213355 0.018386
Education 0.198533 0.030295
Income 0.330611 0.135643
HighBP 0.045017 0.002325
HighChol 0.017457 0.00143
CholCheck 0.004597 0.000165
Smoker 0.011808 0.006184
Stroke 0.035816 0.000002
HeartDiseaseorAttack 0.044123 0.007933
PhysActivity 0.053121 0.001792
Fruits 0.000742 0.00741
Veggies 0.004895 0.003473
HvyAlcoholConsump 0.001377 0.000016
AnyHealthcare 0.000028 0.000022
NoDocbcCost 0.019495 0.004393
DiffWalk 1.0 0.004581
Sex 0.004581 1.0
[21 rows x 21 columns]]
Data Visualization
# Check the inbalance sample size of the two classes
alt.data_transformers.enable('vegafusion')
alt.Chart(train_df, title = "Number of Records of Two Classes").mark_bar().encode(
x = "Diabetes_binary:N",
y = "count()"
).properties(
width=150,
height=250)# Boxplot for Numeric Features
alt.Chart(train_df).mark_boxplot().encode(
x=alt.X('Diabetes_binary:N', title='Diabetes (0/1)'),
y=alt.Y(alt.repeat('row'), type='quantitative')
).properties(
width=200,
height=150
).repeat(
row=numeric_features,
)
# Those having diabetes (diabetes_binary = 1) have a higher BMI on average# Bar Chart of Proportion with Diabetes for Binary Features
alt.Chart(train_df).mark_bar().transform_fold(
binary_features,
as_=['feature', 'value']
).encode(
x=alt.X('value:N', title='0 or 1'),
y=alt.Y('mean(Diabetes_binary):Q', title='Proportion with Diabetes'),
).properties(
width=150,
height=150
).facet(
facet='feature:N',
columns=5
)# Bar Chart for Ordinal Features
alt.Chart(train_df).mark_bar(size=20).encode(
x=alt.X(alt.repeat("row"),type="quantitative", sort="ascending"),
y="count()",
color="Diabetes_binary:N",
column=alt.Column("Diabetes_binary:N")
).properties(
width=200,
height=150
).repeat(
row=ordinal_features
)Model Training
Feature Processing
dat.columnsIndex(['ID', 'Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
'Education', 'Income'],
dtype='object')
# features
numeric_feats = ["GenHlth", "Education", "Income", "Age", "MentHlth", "PhysHlth", "BMI"]
passthrough_feats = [
"HighBP",
"HighChol",
"CholCheck",
"Smoker",
"Stroke",
"HeartDiseaseorAttack",
"PhysActivity",
"Fruits",
"Veggies",
"HvyAlcoholConsump",
"AnyHealthcare",
"NoDocbcCost",
"DiffWalk",
"Sex"
]from sklearn.compose import make_column_transformer
preprocessor = make_column_transformer(
(StandardScaler(), numeric_feats),
("passthrough", passthrough_feats)
)Dummy Classifier
from sklearn.dummy import DummyClassifier
dummy_df = DummyClassifier(strategy="most_frequent", random_state=552)
scores_dummy = pd.DataFrame(cross_validate(dummy_df, X_train, y_train, return_train_score=True)).mean()
scores_dummyfit_time 0.009452
score_time 0.001883
test_score 0.860922
train_score 0.860922
dtype: float64
Logistic Regression
lr_pipe = make_pipeline(preprocessor, LogisticRegression(max_iter=1000))
scores_logistic = cross_validate(lr_pipe, X_train, y_train, return_train_score=True)
results = pd.DataFrame(scores_logistic)
results.mean()fit_time 0.165687
score_time 0.004974
test_score 0.863731
train_score 0.863839
dtype: float64
Linear SVC
from sklearn.svm import LinearSVC
linear_svc_pipe = make_pipeline(preprocessor, LinearSVC(max_iter=5000))
scores = cross_validate(linear_svc_pipe, X_train, y_train, return_train_score=True)
results = pd.DataFrame(scores)
results.mean()fit_time 0.412132
score_time 0.005083
test_score 0.863539
train_score 0.863546
dtype: float64
Final Test (predict on the testset)
from sklearn.metrics import accuracy_score
lr_pipe.fit(X_train, y_train)
prediction_lr = lr_pipe.predict(X_test)
accuracy_lr = accuracy_score(y_test, prediction_lr)
linear_svc_pipe.fit(X_train, y_train)
prediction_svc = linear_svc_pipe.predict(X_test)
accuracy_svc = accuracy_score(y_test, prediction_svc)
print(f"The accuracy of the Logistic Regression model is {accuracy_lr}")
print(f"The accuracy of Linear SVC model is {accuracy_svc}")The accuracy of the Logistic Regression model is 0.8627207505518764
The accuracy of Linear SVC model is 0.8632726269315674
Conclusion
After training, Logistic Regression and Linear SVC produced similar accuracy on X_test (about 86%), with Logistic Regression training faster. Given the small difference, either model could be chosen for further evaluation; if speed and interpretability/probability estimates are important, it would make sense to go with Logistic Regression.
A higher-priority next step is addressing class imbalance and re-evaluating both models to see if they outperform the dummy classifier. This motivates deeper EDA, examining feature distributions and predictions, reviewing confusion matrices, and conducting hyperparameter tuning to test for potential improvements. At this point, we cannot draw firm conclusions about the models’ predictive ability based on the current dataset and features.